Code
install.packages("dplyr")
install.packages("ggplot2")
install.packages("tidyr")
install.packages("flextable")
install.packages("purrr")
install.packages("checkdown")Martin Schweinberger
January 1, 2026


This tutorial introduces the programming side of R: how to write code that makes decisions, repeats itself, and encapsulates reusable logic. These are the tools that transform R from an interactive calculator into a genuine programming environment — the tools you reach for when you want to automate a task, process many files at once, or build a custom analysis pipeline.
The tutorial uses linguistic examples throughout: text cleaning, corpus processing, token counting, and data wrangling tasks typical of language research. By the end, you will be able to write your own functions, process data in loops, and apply the same operation efficiently across many groups or files.
By the end of this tutorial you will be able to:
if/else, ifelse(), and dplyr::case_when() to make data-driven decisions in your codefor loops that iterate over vectors, lists, and files, with properly pre-allocated outputwhile loops for condition-driven iterationsapply(), lapply(), and apply() to replace common loop patternspurrr::map() and its typed variants as a modern alternative to the apply familytryCatch()Before working through this tutorial, please complete:
You should be comfortable with objects, vectors, data frames, and basic dplyr operations before continuing.
Martin Schweinberger. 2026. Working with R: Control Flow, Functions, and Programming. The Language Technology and Data Analysis Laboratory (LADAL), The University of Queensland, Australia. url: https://ladal.edu.au/tutorials/workingwithr/workingwithr.html (Version 2026.03.27).
Install required packages once:
Load packages and set options for this session:
library(dplyr) # data manipulation
library(ggplot2) # data visualisation
library(tidyr) # data reshaping
library(flextable) # formatted tables
library(purrr) # functional programming tools
library(checkdown) # interactive exercises
options(stringsAsFactors = FALSE)
options(scipen = 100)
options(max.print = 100)
set.seed(42)We work with a small simulated corpus throughout the tutorial — text samples with register and metadata:
corpus <- data.frame(
doc_id = paste0("doc", 1:12),
register = rep(c("Academic", "News", "Fiction"), each = 4),
text = c(
"The syntactic properties of embedded clauses remain poorly understood.",
"Phonological alternations in unstressed syllables exhibit considerable variation.",
"Discourse coherence is maintained through a variety of cohesive devices.",
"The morphological complexity of agglutinative languages poses theoretical challenges.",
"Scientists announced a major breakthrough in renewable energy storage yesterday.",
"Local authorities confirmed that road closures will affect the city centre this weekend.",
"The prime minister addressed parliament amid growing calls for electoral reform.",
"Unemployment figures fell sharply in the third quarter according to new statistics.",
"She had not expected the letter to arrive so soon, or to contain such news.",
"The old house creaked and groaned as the storm gathered strength outside.",
"He said nothing for a long time, watching the rain trace patterns on the glass.",
"By morning the fog had lifted and the valley lay green and still below them."
),
n_tokens = c(11, 10, 12, 11, 14, 16, 13, 14, 17, 15, 18, 16),
year = c(2019, 2020, 2021, 2022, 2019, 2020, 2021, 2022,
2019, 2020, 2021, 2022),
stringsAsFactors = FALSE
)What you will learn: How to make R take different actions depending on the data — the foundation of any decision-making code.
Key functions: if, else, else if, ifelse(), dplyr::case_when(), switch()
Why it matters: Real data is messy and varied. Conditional logic lets your code respond intelligently to what it finds rather than assuming all inputs look the same.
if / else statementsAn if statement runs a block of code only when a condition is TRUE. The optional else block runs when it is FALSE.
Corpus is large enough for analysis: 12 documents.
Chain multiple conditions with else if:
Average token count: 13.9 — Complexity: moderate
if requires a single TRUE or FALSE
The condition inside if() must evaluate to exactly one logical value. In R 4.2 and later, passing a vector of logicals is a hard error (in older versions it was a warning that used only the first element). Either way, it is never what you want.
Error: the condition has length > 1
Use any() or all() when you need to reduce a logical vector to a single value:
ifelse() — vectorised conditionalifelse() applies a condition to an entire vector and returns a vector of results — one value per element. This makes it ideal for creating or recoding columns inside dplyr::mutate():
dplyr::case_when() — multiple conditionsWhen you need more than two categories, case_when() is far cleaner than nested ifelse() calls. It works like a series of if/else if conditions evaluated top to bottom — the first matching condition wins:
Early Middle Recent
3 6 3
case_when() evaluation order
Conditions are tested top to bottom and the first match wins. Always put more specific conditions before less specific ones. The final TRUE ~ "value" acts as a catch-all default — it is good practice to always include one, because unmatched rows otherwise become NA.
switch() — selecting among named optionsswitch() is useful when a single variable can take one of several known values and you want to map each to a different result:
describe_register <- function(reg) {
switch(reg,
"Academic" = "Formal; high lexical density; passive constructions common",
"News" = "Neutral; inverted pyramid structure; quotations frequent",
"Fiction" = "Varied; narrative voice; dialogue and description",
"Unknown register" # default
)
}
describe_register("Academic")[1] "Formal; high lexical density; passive constructions common"
[1] "Unknown register"
Q1. What is the key difference between if and ifelse() in R?
Q2. In a case_when() call, what does the final TRUE ~ "Unknown" line do?
Q3. You want to add a column pos_class that is "function" when word is in c("the", "a", "of", "in") and "content" otherwise. Which code is correct?
for LoopsWhat you will learn: How to repeat a block of code for each element of a sequence or list.
Key concepts: Loop variable, iteration, pre-allocation, seq_along()
Why it matters: Loops automate repetitive tasks — processing multiple files, computing statistics per document, or building up results iteratively.
A for loop iterates over a sequence, executing its body once per element. The loop variable takes each element’s value in turn:
Academic : 4 documents
News : 4 documents
Fiction : 4 documents
When you need both the element and its position, loop over indices using seq_along(). This is safer than 1:length(x) because it handles zero-length vectors correctly:
Word 1: syntax (6 characters)
Word 2: morphology (10 characters)
Word 3: phonology (9 characters)
Word 4: pragmatics (10 characters)
Word 5: semantics (9 characters)
The most important loop performance rule: pre-allocate your output object before the loop, then fill it by index. Growing a vector by appending inside a loop forces R to copy the entire vector on every iteration — catastrophically slow for large inputs:
# Slow: growing inside the loop copies the vector on every iteration
results_slow <- c()
for (i in seq_along(words)) {
results_slow <- c(results_slow, nchar(words[i]))
}
# Fast: pre-allocate, then fill by index
results_fast <- integer(length(words))
for (i in seq_along(words)) {
results_fast[i] <- nchar(words[i])
}
results_fast[1] 6 10 9 10 9
Here we loop over registers, compute summary statistics for each, and collect results in a pre-allocated list:
registers <- unique(corpus$register)
summaries <- vector("list", length(registers))
names(summaries) <- registers
for (reg in registers) {
subset_df <- corpus[corpus$register == reg, ]
summaries[[reg]] <- data.frame(
register = reg,
n_docs = nrow(subset_df),
mean_tok = round(mean(subset_df$n_tokens), 1),
sd_tok = round(sd(subset_df$n_tokens), 2),
min_tok = min(subset_df$n_tokens),
max_tok = max(subset_df$n_tokens)
)
}
do.call(rbind, summaries) |>
flextable() |>
flextable::set_table_properties(width = .85, layout = "autofit") |>
flextable::theme_zebra() |>
flextable::fontsize(size = 12) |>
flextable::fontsize(size = 12, part = "header") |>
flextable::align_text_col(align = "center") |>
flextable::set_caption(caption = "Token statistics per register computed with a for loop.") |>
flextable::border_outer()register | n_docs | mean_tok | sd_tok | min_tok | max_tok |
|---|---|---|---|---|---|
Academic | 4 | 11.0 | 0.82 | 10 | 12 |
News | 4 | 14.2 | 1.26 | 13 | 16 |
Fiction | 4 | 16.5 | 1.29 | 15 | 18 |
One of the most practical uses of for loops in corpus linguistics is processing many text files in a directory:
txt_files <- list.files(path = "data/corpus/",
pattern = "\\.txt$",
full.names = TRUE)
results <- data.frame(
filename = character(length(txt_files)),
n_chars = integer(length(txt_files)),
n_lines = integer(length(txt_files)),
stringsAsFactors = FALSE
)
for (i in seq_along(txt_files)) {
text <- readLines(txt_files[i], warn = FALSE)
results$filename[i] <- basename(txt_files[i])
results$n_chars[i] <- sum(nchar(text))
results$n_lines[i] <- length(text)
}
head(results)break and nextTwo special keywords control loop flow. break exits the loop immediately; next skips to the next iteration:
Long documents only:
doc5 - 14 tokens
doc6 - 16 tokens
doc7 - 13 tokens
doc8 - 14 tokens
doc9 - 17 tokens
doc10 - 15 tokens
doc11 - 18 tokens
doc12 - 16 tokens
First Academic document:
doc1 : The syntactic properties of embedded clauses remai ...
for loopsLoops can be nested — the inner loop runs completely for each iteration of the outer loop:
Documents per register x era:
Academic x Early : 1
Academic x Middle : 2
Academic x Recent : 1
News x Early : 1
News x Middle : 2
News x Recent : 1
Fiction x Early : 1
Fiction x Middle : 2
Fiction x Recent : 1
Before writing a loop, ask: does a vectorised function or dplyr verb already do this? Vectorised operations in R are implemented in C and run far faster than R-level loops.
[1] 70 81 72 85 80 88 80 83 75 73 79 76
# A tibble: 3 × 2
register mean_tok
<chr> <dbl>
1 Academic 11
2 Fiction 16.5
3 News 14.2
Loops shine when each iteration depends on the result of the previous one, when you are reading or writing files, or when no vectorised alternative exists.
for Loops
Q1. Why should you pre-allocate your output vector before a for loop rather than growing it with c() inside the loop?
Q2. What does next do inside a for loop?
Q3. Why is seq_along(x) preferred over 1:length(x) when looping over a vector x?
while LoopsWhat you will learn: How to write loops that run until a condition changes rather than for a fixed number of iterations.
Key concepts: Loop condition, infinite loops, break as a safety exit
When to use: Convergence algorithms, reading data streams, retrying failed operations
A while loop runs its body as long as its condition remains TRUE. Use it when the number of iterations is not known in advance.
Reached 58 tokens after 5 documents.
Here we simulate reading tokens from a stream until we hit a sentence boundary:
tokens <- c("The", "quick", "brown", "fox", "jumps", ".", "Over", "the", "lazy")
sentence <- character(0)
j <- 0
while (j < length(tokens)) {
j <- j + 1
current <- tokens[j]
sentence <- c(sentence, current)
if (grepl("\\.$", current)) break
}
cat("First sentence:", paste(sentence, collapse = " "), "\n")First sentence: The quick brown fox jumps .
A while loop runs forever if its condition never becomes FALSE. Always ensure the loop body modifies the condition variable, and include a maximum iteration counter as a safety exit:
Converged to 0.9698 after 44 iterations.
If you accidentally create an infinite loop, press Escape in the Console, or click the Stop button (red square) in the Console toolbar. RStudio will interrupt the running code. If that fails, use Session → Interrupt R from the menu.
while Loops
Q1. When is a while loop more appropriate than a for loop?
Q2. What is the risk of writing while (TRUE) { ... } without a break statement inside the body?
What you will learn: How to write your own reusable functions — the single most important skill for writing clean, maintainable R code.
Key concepts: Function definition, arguments, default values, return values, scope, documentation
Why it matters: Functions eliminate copy-paste errors, make your intentions explicit, and make code testable and shareable. If you have written the same block of code more than twice, it should be a function.
[1] "Hello from computational linguistics!"
[1] "Hello from corpus linguistics!"
Arguments without a default are required — omitting them raises an error. Arguments with a default are optional and use their default when not supplied:
[1] 0.625
[1] 0.75
A function automatically returns its last evaluated expression. Use return() explicitly for early exits when input validation requires it:
safe_ttr <- function(tokens, lowercase = TRUE) {
if (length(tokens) == 0) {
warning("Empty token vector supplied — returning NA.")
return(NA_real_)
}
if (!is.character(tokens)) {
stop("tokens must be a character vector.")
}
if (lowercase) tokens <- tolower(tokens)
length(unique(tokens)) / length(tokens)
}
safe_ttr(character(0)) # triggers warning, returns NAWarning in safe_ttr(character(0)): Empty token vector supplied — returning NA.
[1] NA
[1] 0.625
Functions can return only one object, but that object can be a named list containing as many results as needed:
corpus_stats <- function(tokens, lowercase = TRUE) {
if (lowercase) tokens <- tolower(tokens)
list(
n_tokens = length(tokens),
n_types = length(unique(tokens)),
ttr = round(length(unique(tokens)) / length(tokens), 3),
longest = tokens[which.max(nchar(tokens))]
)
}
result <- corpus_stats(sample_tokens)
result$ttr[1] 0.625
[1] "the"
List of 4
$ n_tokens: int 8
$ n_types : int 5
$ ttr : num 0.625
$ longest : chr "the"
Variables created inside a function live only inside that function — they are invisible to the global environment and cannot accidentally overwrite your workspace objects:
[1] "hello world"
[1] FALSE
<<- operator
If you need to modify a variable in the calling environment from inside a function (rare), use <<-. This searches up the call stack and modifies the variable there. However, this is considered bad practice in most data analysis code because it creates hidden side effects that make functions unpredictable. Prefer returning a value and assigning it explicitly.
Good functions should be documented so you and colleagues can understand them months later. The conventional format mirrors the roxygen2 package style:
#' Compute Type-Token Ratio
#'
#' @description
#' Calculates the type-token ratio (TTR) of a character vector of tokens.
#' TTR = number of unique word types / total number of tokens.
#'
#' @param tokens A character vector of tokens (words).
#' @param lowercase Logical. If TRUE (default), tokens are lowercased before
#' counting, so "The" and "the" count as the same type.
#'
#' @return A single numeric value between 0 and 1. Values closer to 1
#' indicate higher lexical diversity.
#'
#' @examples
#' ttr(c("the", "cat", "sat", "on", "the", "mat"))
#' ttr(c("The", "Cat", "sat"), lowercase = FALSE)
ttr <- function(tokens, lowercase = TRUE) {
if (length(tokens) == 0) return(NA_real_)
if (lowercase) tokens <- tolower(tokens)
length(unique(tokens)) / length(tokens)
}Here is a realistic example: a family of small, focused functions composed into a pipeline:
normalise_text <- function(text) {
text <- tolower(trimws(text))
gsub("\\s+", " ", text)
}
remove_punct <- function(text) {
gsub("[[:punct:]]", "", text)
}
tokenise <- function(text) {
strsplit(text, "\\s+")[[1]]
}
remove_stopwords <- function(tokens,
stopwords = c("the","a","an","of","in","and","to","is")) {
tokens[!tokens %in% stopwords]
}
clean_and_tokenise <- function(text, stopwords = NULL) {
text <- normalise_text(text)
text <- remove_punct(text)
tokens <- tokenise(text)
if (!is.null(stopwords)) tokens <- remove_stopwords(tokens, stopwords)
tokens
}
# Apply to a single document
example_text <- "The syntactic properties of embedded clauses remain poorly understood."
clean_and_tokenise(example_text,
stopwords = c("the","a","an","of","in","and","to","is"))[1] "syntactic" "properties" "embedded" "clauses" "remain"
[6] "poorly" "understood"
doc_id register n_tokens content_tokens
1 doc1 Academic 11 7
2 doc2 Academic 10 7
3 doc3 Academic 12 7
4 doc4 Academic 11 7
5 doc5 News 14 8
6 doc6 News 16 12
Q1. A function has no explicit return() statement. What does it return?
Q2. You write x <- 99 inside a function body. After calling the function, does x exist in the global environment?
Q3. Your function computes three things: n_tokens, n_types, and TTR. What is the best way to return all three?
apply FamilyWhat you will learn: How to apply a function to every element of a vector or list without writing an explicit loop.
Key functions: sapply(), lapply(), apply()
Why it matters: The apply family is more concise than loops and expresses intent clearly — “apply this function to each element of this object.”
sapply() — simplified applysapply() applies a function to each element of a vector or list and simplifies the result to a vector or matrix if possible:
The syntactic properties of embedded clauses remain poorly understood.
70
Phonological alternations in unstressed syllables exhibit considerable variation.
81
Discourse coherence is maintained through a variety of cohesive devices.
72
The morphological complexity of agglutinative languages poses theoretical challenges.
85
Scientists announced a major breakthrough in renewable energy storage yesterday.
80
Local authorities confirmed that road closures will affect the city centre this weekend.
88
doc1 doc2 doc3
0.8333333 1.0000000 0.6666667
Use an anonymous function for more complex operations:
The syntactic properties of embedded clauses remain poorly understood.
7
Phonological alternations in unstressed syllables exhibit considerable variation.
7
Discourse coherence is maintained through a variety of cohesive devices.
7
The morphological complexity of agglutinative languages poses theoretical challenges.
7
Scientists announced a major breakthrough in renewable energy storage yesterday.
8
Local authorities confirmed that road closures will affect the city centre this weekend.
12
The prime minister addressed parliament amid growing calls for electoral reform.
9
Unemployment figures fell sharply in the third quarter according to new statistics.
8
She had not expected the letter to arrive so soon, or to contain such news.
7
The old house creaked and groaned as the storm gathered strength outside.
7
He said nothing for a long time, watching the rain trace patterns on the glass.
9
By morning the fog had lifted and the valley lay green and still below them.
7
lapply() — list applylapply() always returns a list, making it safer when results have different lengths or types:
apply() — matrix / data frame applyapply() operates on matrices or data frames, applying a function across rows (MARGIN = 1) or columns (MARGIN = 2):
n_tokens content_tokens
13.91667 9.25000
doc1 doc2 doc3 doc4 doc5 doc6
18 17 19 18 22 28
sapply() and lapply()Function | Input | Output | Use when |
|---|---|---|---|
sapply() | vector or list | vector/matrix (simplified) or list if simplification fails | results are all the same type and length |
lapply() | vector or list | always a list | results differ in length or type; you always want a list |
apply() | matrix or data frame | vector or list | you want to summarise across rows or columns of a matrix |
purrrWhat you will learn: How to use purrr::map() and its variants as a modern, consistent alternative to the apply family.
Key functions: map(), map_chr(), map_dbl(), map_df(), map2(), walk()
Why it matters: purrr functions have consistent, predictable behaviour and integrate cleanly with dplyr pipelines.
The purrr package provides a family of map() functions that replace the apply family with a more consistent interface. Every map() function takes a list or vector and applies a function to each element.
map() and type-specific variantsmap() always returns a list. Type-specific variants guarantee a particular output type and fail informatively if the results do not match:
$doc1
[1] 6
$doc2
[1] 9
$doc3
[1] 6
doc1 doc2 doc3
0.8333333 1.0000000 0.6666667
doc1 doc2 doc3
6 9 6
doc1 doc2 doc3
"the cat" "a quick" "to be"
map_df() — map to a data framemap_df() applies a function that returns a data frame to each element and binds the results together:
doc n_tokens n_types ttr
1 doc1 6 5 0.833
2 doc2 9 9 1.000
3 doc3 6 4 0.667
map2() — map over two inputs simultaneouslymap2() applies a function to corresponding elements of two vectors or lists:
text_A : TTR = 1
text_B : TTR = 1
text_C : TTR = 0.75
[1] 1.00 1.00 0.75
walk() — map for side effectswalk() is like map() but used when you want the side effect (printing, writing a file, making a plot) rather than the return value. It invisibly returns the input, enabling piping:
Register: Academic | Docs: 4 | Mean tokens: 11.0
Register: Fiction | Docs: 4 | Mean tokens: 16.5
Register: News | Docs: 4 | Mean tokens: 14.2
apply and purrr
Q1. What is the difference between sapply() and lapply()?
Q2. When would you use purrr::walk() instead of purrr::map()?
What you will learn: How to write code that handles errors and warnings gracefully rather than crashing.
Key functions: tryCatch(), try(), stop(), warning(), message()
Why it matters: When processing many files or documents, a single error should not halt your entire pipeline.
Use stop(), warning(), and message() to communicate problems from inside your functions:
compute_ttr <- function(tokens) {
if (!is.character(tokens)) stop("tokens must be a character vector")
if (length(tokens) == 0) warning("Empty vector — returning NA")
if (length(tokens) < 10) message("Note: TTR is unreliable for short texts")
if (length(tokens) == 0) return(NA_real_)
length(unique(tokens)) / length(tokens)
}
compute_ttr(c("the", "cat", "sat")) # triggers message: short textNote: TTR is unreliable for short texts
[1] 1
The three signals have different effects on execution: stop() halts immediately; warning() signals a problem but continues; message() prints an informational note and continues.
tryCatch() — handle errors gracefullytryCatch() intercepts errors, warnings, and messages, letting you decide what to do instead of crashing:
safe_ttr <- function(tokens) {
tryCatch(
expr = compute_ttr(tokens),
error = function(e) {
cat("Error in compute_ttr:", conditionMessage(e), "\n")
NA_real_
},
warning = function(w) {
cat("Warning:", conditionMessage(w), "\n")
NA_real_
}
)
}
safe_ttr(c("the", "cat", "sat", "on", "the", "mat")) # normalNote: TTR is unreliable for short texts
[1] 0.8333333
Error in compute_ttr: tokens must be a character vector
[1] NA
Warning: Empty vector — returning NA
[1] NA
tryCatch() across a pipelineThis pattern is invaluable when processing many documents — one bad item should not stop the whole run:
Note: TTR is unreliable for short texts
Error in compute_ttr: tokens must be a character vector
Warning: Empty vector — returning NA
Note: TTR is unreliable for short texts
[1] 0.8333333 NA NA 1.0000000
Q1. What is the difference between stop(), warning(), and message() inside a function?
Q2. Why is wrapping a function call in tryCatch() useful when processing a large number of files or documents?
A concise guide to writing better R code: when to loop, when to vectorise, how to name and document functions, and the DRY principle.
Situation | Best tool |
|---|---|
Apply the same operation to every element of a vector | Vectorised operation (e.g. nchar(), tolower(), arithmetic) |
Apply the same operation to each group in a data frame | dplyr::group_by() + summarise() or mutate() |
Apply a function to each element and collect results | sapply() / lapply() / purrr::map() |
Iterate when each step depends on the previous result | for loop with pre-allocated output |
Number of iterations unknown; stop when condition met | while loop (with break safety exit) |
Apply a function for its side effects (print, save, plot) | purrr::walk() or a for loop |
Handle different cases of a single categorical variable | ifelse() / case_when() / switch() |
clean_tokenise_count_and_plot() is a sign it should be four functionsclean_text(), compute_ttr(), plot_frequency(), not myFunc() or data2()stop() at the top of the function body for invalid argumentsNA), and invalid inputs# Good: clear structure, consistent indentation, descriptive names
compute_register_stats <- function(data, group_col = "register") {
data |>
dplyr::group_by(.data[[group_col]]) |>
dplyr::summarise(
n = dplyr::n(),
mean_tok = round(mean(n_tokens), 1),
sd_tok = round(sd(n_tokens), 2),
.groups = "drop"
)
}
# Bad: cryptic names, no whitespace, no structure
f<-function(d,g="register"){d%>%group_by(.data[[g]])%>%summarise(n=n(),m=round(mean(n_tokens),1))}Don’t Repeat Yourself. If you find yourself copy-pasting a block of code and changing one value, that block should be a function parameterised by that value. Code duplication multiplies the places you must update when requirements change and multiplies the opportunities for inconsistency.
# Before: copy-pasted three times with minor changes
academic_ttr <- ...
news_ttr <- ...
fiction_ttr <- ...
# After: one function, called three times
get_register_ttr <- function(data, reg) { ... }
sapply(c("Academic", "News", "Fiction"), get_register_ttr, data = corpus)Martin Schweinberger. 2026. Working with R: Control Flow, Functions, and Programming. The Language Technology and Data Analysis Laboratory (LADAL), The University of Queensland, Australia. url: https://ladal.edu.au/tutorials/workingwithr/workingwithr.html (Version 2026.03.27), doi: .
@manual{martinschweinberger2026working,
author = {Martin Schweinberger},
title = {Working with R: Control Flow, Functions, and Programming},
year = {2026},
note = {https://ladal.edu.au/tutorials/workingwithr/workingwithr.html},
organization = {The Language Technology and Data Analysis Laboratory (LADAL), The University of Queensland, Australia},
edition = {2026.03.27}
doi = {}
}
R version 4.4.2 (2024-10-31 ucrt)
Platform: x86_64-w64-mingw32/x64
Running under: Windows 11 x64 (build 26200)
Matrix products: default
locale:
[1] LC_COLLATE=English_United States.utf8
[2] LC_CTYPE=English_United States.utf8
[3] LC_MONETARY=English_United States.utf8
[4] LC_NUMERIC=C
[5] LC_TIME=English_United States.utf8
time zone: Australia/Brisbane
tzcode source: internal
attached base packages:
[1] stats graphics grDevices datasets utils methods base
other attached packages:
[1] purrr_1.0.4 flextable_0.9.7 tidyr_1.3.2 ggplot2_4.0.2
[5] dplyr_1.2.0 checkdown_0.0.13
loaded via a namespace (and not attached):
[1] utf8_1.2.4 generics_0.1.3 fontLiberation_0.1.0
[4] renv_1.1.1 xml2_1.3.6 digest_0.6.39
[7] magrittr_2.0.3 evaluate_1.0.3 grid_4.4.2
[10] RColorBrewer_1.1-3 fastmap_1.2.0 jsonlite_1.9.0
[13] zip_2.3.2 scales_1.4.0 fontBitstreamVera_0.1.1
[16] codetools_0.2-20 textshaping_1.0.0 cli_3.6.4
[19] rlang_1.1.7 fontquiver_0.2.1 litedown_0.9
[22] commonmark_2.0.0 withr_3.0.2 yaml_2.3.10
[25] gdtools_0.4.1 tools_4.4.2 officer_0.6.7
[28] uuid_1.2-1 vctrs_0.7.1 R6_2.6.1
[31] lifecycle_1.0.5 htmlwidgets_1.6.4 ragg_1.3.3
[34] pkgconfig_2.0.3 pillar_1.10.1 gtable_0.3.6
[37] data.table_1.17.0 glue_1.8.0 Rcpp_1.0.14
[40] systemfonts_1.2.1 xfun_0.56 tibble_3.2.1
[43] tidyselect_1.2.1 rstudioapi_0.17.1 knitr_1.51
[46] farver_2.1.2 htmltools_0.5.9 rmarkdown_2.30
[49] compiler_4.4.2 S7_0.2.1 askpass_1.2.1
[52] markdown_2.0 openssl_2.3.2
This tutorial was re-developed with the assistance of Claude (claude.ai), a large language model created by Anthropic. Claude was used to help revise the tutorial text, structure the instructional content, generate the R code examples, and write the checkdown quiz questions and feedback strings. All content was reviewed, edited, and approved by the author (Martin Schweinberger), who takes full responsibility for the accuracy and pedagogical appropriateness of the material. The use of AI assistance is disclosed here in the interest of transparency and in accordance with emerging best practices for AI-assisted academic content creation.
---
title: "Working with R: Control Flow, Functions, and Programming"
author: "Martin Schweinberger"
date: "2026"
params:
title: "Working with R: Control Flow, Functions, and Programming"
author: "Martin Schweinberger"
year: "2026"
version: "2026.03.27"
url: "https://ladal.edu.au/tutorials/workingwithr/workingwithr.html"
institution: "The Language Technology and Data Analysis Laboratory (LADAL), The University of Queensland, Australia"
description: "This tutorial covers core R programming concepts including conditional logic, for and while loops, custom functions, the apply family of functions, functional programming with purrr, and error handling with tryCatch. It is aimed at researchers in linguistics and the humanities who want to move beyond basic R usage and write reusable, automated analysis pipelines."
doi: "10.5281/zenodo.19332999 "
format:
html:
toc: true
toc-depth: 4
code-fold: show
code-tools: true
theme: cosmo
---
```{r setup, echo=FALSE, message=FALSE, warning=FALSE}
library(checkdown)
library(dplyr)
library(ggplot2)
library(tidyr)
library(flextable)
library(purrr)
options(stringsAsFactors = FALSE)
options(scipen = 100)
options(max.print = 100)
set.seed(42)
```
{ width=100% }
# Introduction {#intro}
{ width=15% style="float:right; padding:10px" }
This tutorial introduces the **programming** side of R: how to write code that makes decisions, repeats itself, and encapsulates reusable logic. These are the tools that transform R from an interactive calculator into a genuine programming environment — the tools you reach for when you want to automate a task, process many files at once, or build a custom analysis pipeline.
The tutorial uses linguistic examples throughout: text cleaning, corpus processing, token counting, and data wrangling tasks typical of language research. By the end, you will be able to write your own functions, process data in loops, and apply the same operation efficiently across many groups or files.
::: {.callout-note}
## Learning Objectives
By the end of this tutorial you will be able to:
1. Use `if`/`else`, `ifelse()`, and `dplyr::case_when()` to make data-driven decisions in your code
2. Write `for` loops that iterate over vectors, lists, and files, with properly pre-allocated output
3. Write `while` loops for condition-driven iteration
4. Define your own R functions with required and optional arguments, default values, and meaningful return values
5. Explain variable scope and why it matters for writing reliable functions
6. Use `sapply()`, `lapply()`, and `apply()` to replace common loop patterns
7. Use `purrr::map()` and its typed variants as a modern alternative to the `apply` family
8. Handle errors and warnings gracefully using `tryCatch()`
9. Apply best-practice principles: DRY code, meaningful names, input validation, and documentation
:::
::: {.callout-note}
## Prerequisite Tutorials
Before working through this tutorial, please complete:
- [Getting Started with R and RStudio](/tutorials/intror/intror.html)
- [Loading, Saving, and Generating Data in R](/tutorials/load/load.html)
- [Handling Tables in R](/tutorials/table/table.html)
You should be comfortable with objects, vectors, data frames, and basic `dplyr` operations before continuing.
:::
::: {.callout-note}
## Citation
```{r citation-callout-top, echo=FALSE, results='asis'}
cat(
params$author, ". ",
params$year, ". *",
params$title, "*. ",
params$institution, ". ",
"url: ", params$url, " ",
"(Version ", params$version, ").",
sep = ""
)
```
:::
---
## Preparation and Session Set-up {-}
Install required packages once:
```{r install, eval=FALSE}
install.packages("dplyr")
install.packages("ggplot2")
install.packages("tidyr")
install.packages("flextable")
install.packages("purrr")
install.packages("checkdown")
```
Load packages and set options for this session:
```{r load, message=FALSE, warning=FALSE}
library(dplyr) # data manipulation
library(ggplot2) # data visualisation
library(tidyr) # data reshaping
library(flextable) # formatted tables
library(purrr) # functional programming tools
library(checkdown) # interactive exercises
options(stringsAsFactors = FALSE)
options(scipen = 100)
options(max.print = 100)
set.seed(42)
```
We work with a small simulated corpus throughout the tutorial — text samples with register and metadata:
```{r sample-data}
corpus <- data.frame(
doc_id = paste0("doc", 1:12),
register = rep(c("Academic", "News", "Fiction"), each = 4),
text = c(
"The syntactic properties of embedded clauses remain poorly understood.",
"Phonological alternations in unstressed syllables exhibit considerable variation.",
"Discourse coherence is maintained through a variety of cohesive devices.",
"The morphological complexity of agglutinative languages poses theoretical challenges.",
"Scientists announced a major breakthrough in renewable energy storage yesterday.",
"Local authorities confirmed that road closures will affect the city centre this weekend.",
"The prime minister addressed parliament amid growing calls for electoral reform.",
"Unemployment figures fell sharply in the third quarter according to new statistics.",
"She had not expected the letter to arrive so soon, or to contain such news.",
"The old house creaked and groaned as the storm gathered strength outside.",
"He said nothing for a long time, watching the rain trace patterns on the glass.",
"By morning the fog had lifted and the valley lay green and still below them."
),
n_tokens = c(11, 10, 12, 11, 14, 16, 13, 14, 17, 15, 18, 16),
year = c(2019, 2020, 2021, 2022, 2019, 2020, 2021, 2022,
2019, 2020, 2021, 2022),
stringsAsFactors = FALSE
)
```
---
# Conditional Logic {#conditionals}
::: {.callout-note}
## Section Overview
**What you will learn:** How to make R take different actions depending on the data — the foundation of any decision-making code.
**Key functions:** `if`, `else`, `else if`, `ifelse()`, `dplyr::case_when()`, `switch()`
**Why it matters:** Real data is messy and varied. Conditional logic lets your code respond intelligently to what it finds rather than assuming all inputs look the same.
:::
## `if` / `else` statements {-}
An `if` statement runs a block of code **only when a condition is `TRUE`**. The optional `else` block runs when it is `FALSE`.
```{r if-basic}
n_docs <- nrow(corpus)
if (n_docs >= 10) {
cat("Corpus is large enough for analysis:", n_docs, "documents.\n")
} else {
cat("Corpus may be too small:", n_docs, "documents.\n")
}
```
Chain multiple conditions with `else if`:
```{r if-elseif}
mean_tokens <- mean(corpus$n_tokens)
if (mean_tokens < 10) {
complexity <- "low"
} else if (mean_tokens < 15) {
complexity <- "moderate"
} else {
complexity <- "high"
}
cat("Average token count:", round(mean_tokens, 1),
"— Complexity:", complexity, "\n")
```
::: {.callout-important}
## `if` requires a single `TRUE` or `FALSE`
The condition inside `if()` must evaluate to exactly one logical value. In R 4.2 and later, passing a vector of logicals is a hard error (in older versions it was a warning that used only the first element). Either way, it is never what you want.
```{r if-vector-warn}
x <- c(TRUE, FALSE, TRUE)
tryCatch(
if (x) print("this only checks x[1]"),
error = function(e) cat("Error:", conditionMessage(e), "\n")
)
```
Use `any()` or `all()` when you need to reduce a logical vector to a single value:
```{r if-any-all}
if (any(x)) cat("At least one TRUE\n")
if (all(x)) cat("All TRUE\n") else cat("Not all TRUE\n")
```
:::
## `ifelse()` — vectorised conditional {-}
`ifelse()` applies a condition to an **entire vector** and returns a vector of results — one value per element. This makes it ideal for creating or recoding columns inside `dplyr::mutate()`:
```{r ifelse}
corpus <- corpus |>
dplyr::mutate(
length_class = ifelse(n_tokens >= 15, "long", "short")
)
table(corpus$length_class)
```
## `dplyr::case_when()` — multiple conditions {-}
When you need more than two categories, `case_when()` is far cleaner than nested `ifelse()` calls. It works like a series of `if`/`else if` conditions evaluated top to bottom — the first matching condition wins:
```{r case-when}
corpus <- corpus |>
dplyr::mutate(
era = dplyr::case_when(
year <= 2019 ~ "Early",
year <= 2021 ~ "Middle",
year == 2022 ~ "Recent",
TRUE ~ "Unknown" # catch-all default
)
)
table(corpus$era)
```
::: {.callout-tip}
## `case_when()` evaluation order
Conditions are tested top to bottom and the first match wins. Always put more specific conditions before less specific ones. The final `TRUE ~ "value"` acts as a catch-all default — it is good practice to always include one, because unmatched rows otherwise become `NA`.
:::
## `switch()` — selecting among named options {-}
`switch()` is useful when a single variable can take one of several known values and you want to map each to a different result:
```{r switch}
describe_register <- function(reg) {
switch(reg,
"Academic" = "Formal; high lexical density; passive constructions common",
"News" = "Neutral; inverted pyramid structure; quotations frequent",
"Fiction" = "Varied; narrative voice; dialogue and description",
"Unknown register" # default
)
}
describe_register("Academic")
describe_register("Blog") # falls through to default
```
---
::: {.callout-tip}
## Exercises: Conditional Logic
:::
**Q1. What is the key difference between `if` and `ifelse()` in R?**
```{r}
#| echo: false
#| label: "COND_Q1"
check_question("`if` evaluates a single TRUE/FALSE and runs a code block; `ifelse()` evaluates a vector of conditions and returns a vector of values",
options = c(
"`if` evaluates a single TRUE/FALSE and runs a code block; `ifelse()` evaluates a vector of conditions and returns a vector of values",
"They are identical — `ifelse()` is just a shorthand alias for `if`",
"`if` is faster than `ifelse()` for all inputs",
"`ifelse()` can only handle numeric conditions; `if` handles any type"
),
type = "radio",
q_id = "COND_Q1",
random_answer_order = TRUE,
button_label = "Check answer",
right = "Correct! `if(condition) { block }` expects exactly one TRUE or FALSE and determines which block of code to execute. `ifelse(condition, yes, no)` accepts a vector of logicals and returns a vector — for each TRUE it returns the corresponding `yes` value, for each FALSE the `no` value. This vectorised behaviour makes `ifelse()` ideal inside `dplyr::mutate()` for recoding entire columns.",
wrong = "Think about what each construct *returns* and what kind of input it accepts. One runs a block of code once; the other produces a vector of values.")
```
**Q2. In a `case_when()` call, what does the final `TRUE ~ "Unknown"` line do?**
```{r}
#| echo: false
#| label: "COND_Q2"
check_question("It acts as a catch-all default, matching any rows not matched by the earlier conditions",
options = c(
"It acts as a catch-all default, matching any rows not matched by the earlier conditions",
"It checks whether the variable equals the logical value TRUE",
"It causes an error — TRUE is not a valid left-hand side in case_when()",
"It replaces all values with 'Unknown', overriding the earlier conditions"
),
type = "radio",
q_id = "COND_Q2",
random_answer_order = TRUE,
button_label = "Check answer",
right = "Correct! In `case_when()`, conditions are tested top to bottom and the first match wins. `TRUE` always evaluates to `TRUE`, so `TRUE ~ \"Unknown\"` matches every row that has not already been matched by an earlier condition. It is the equivalent of the final `else` in a chain of `if`/`else if` statements. Omitting it means unmatched rows become `NA`.",
wrong = "Remember that `case_when()` evaluates conditions from top to bottom. A condition that is always `TRUE` placed at the end will only ever be reached by rows that did not match anything above it.")
```
**Q3. You want to add a column `pos_class` that is `"function"` when `word` is in `c("the", "a", "of", "in")` and `"content"` otherwise. Which code is correct?**
```{r}
#| echo: false
#| label: "COND_Q3"
check_question('df %>% mutate(pos_class = ifelse(word %in% c("the","a","of","in"), "function", "content"))',
options = c(
'df %>% mutate(pos_class = ifelse(word %in% c("the","a","of","in"), "function", "content"))',
'df %>% mutate(pos_class = if(word %in% c("the","a","of","in")) "function" else "content")',
'df %>% filter(word %in% c("the","a","of","in")) %>% mutate(pos_class = "function")',
'df %>% mutate(pos_class = case_when(word == "the" ~ "function", word == "a" ~ "function"))'
),
type = "radio",
q_id = "COND_Q3",
random_answer_order = TRUE,
button_label = "Check answer",
right = 'Correct! `%in%` tests membership in a vector, and `ifelse()` applies this test to every row of the column, returning "function" or "content" accordingly. The `if`/`else` option would fail because `if` cannot handle vectors. The `filter()` option would *remove* non-function words rather than labelling them. The `case_when()` option would work only for "the" and "a" — the remaining words would become `NA` without a catch-all.',
wrong = 'Think about which function handles vectorised conditions (needed for a whole column) and which operator tests whether a value belongs to a set.')
```
---
# `for` Loops {#forloops}
::: {.callout-note}
## Section Overview
**What you will learn:** How to repeat a block of code for each element of a sequence or list.
**Key concepts:** Loop variable, iteration, pre-allocation, `seq_along()`
**Why it matters:** Loops automate repetitive tasks — processing multiple files, computing statistics per document, or building up results iteratively.
:::
A `for` loop **iterates** over a sequence, executing its body once per element. The loop variable takes each element's value in turn:
```{r for-basic}
registers <- unique(corpus$register)
for (reg in registers) {
n <- sum(corpus$register == reg)
cat(reg, ":", n, "documents\n")
}
```
## Looping with indices {-}
When you need both the element and its position, loop over indices using `seq_along()`. This is safer than `1:length(x)` because it handles zero-length vectors correctly:
```{r for-index}
words <- c("syntax", "morphology", "phonology", "pragmatics", "semantics")
for (i in seq_along(words)) {
cat(sprintf("Word %d: %-12s (%d characters)\n",
i, words[i], nchar(words[i])))
}
```
## Storing results: pre-allocation {-}
The most important loop performance rule: **pre-allocate** your output object before the loop, then fill it by index. Growing a vector by appending inside a loop forces R to copy the entire vector on every iteration — catastrophically slow for large inputs:
```{r prealloc}
# Slow: growing inside the loop copies the vector on every iteration
results_slow <- c()
for (i in seq_along(words)) {
results_slow <- c(results_slow, nchar(words[i]))
}
# Fast: pre-allocate, then fill by index
results_fast <- integer(length(words))
for (i in seq_along(words)) {
results_fast[i] <- nchar(words[i])
}
results_fast
```
## A realistic corpus example {-}
Here we loop over registers, compute summary statistics for each, and collect results in a pre-allocated list:
```{r corpus-loop}
registers <- unique(corpus$register)
summaries <- vector("list", length(registers))
names(summaries) <- registers
for (reg in registers) {
subset_df <- corpus[corpus$register == reg, ]
summaries[[reg]] <- data.frame(
register = reg,
n_docs = nrow(subset_df),
mean_tok = round(mean(subset_df$n_tokens), 1),
sd_tok = round(sd(subset_df$n_tokens), 2),
min_tok = min(subset_df$n_tokens),
max_tok = max(subset_df$n_tokens)
)
}
do.call(rbind, summaries) |>
flextable() |>
flextable::set_table_properties(width = .85, layout = "autofit") |>
flextable::theme_zebra() |>
flextable::fontsize(size = 12) |>
flextable::fontsize(size = 12, part = "header") |>
flextable::align_text_col(align = "center") |>
flextable::set_caption(caption = "Token statistics per register computed with a for loop.") |>
flextable::border_outer()
```
## Looping over files {-}
One of the most practical uses of `for` loops in corpus linguistics is processing many text files in a directory:
```{r file-loop, eval=FALSE}
txt_files <- list.files(path = "data/corpus/",
pattern = "\\.txt$",
full.names = TRUE)
results <- data.frame(
filename = character(length(txt_files)),
n_chars = integer(length(txt_files)),
n_lines = integer(length(txt_files)),
stringsAsFactors = FALSE
)
for (i in seq_along(txt_files)) {
text <- readLines(txt_files[i], warn = FALSE)
results$filename[i] <- basename(txt_files[i])
results$n_chars[i] <- sum(nchar(text))
results$n_lines[i] <- length(text)
}
head(results)
```
## `break` and `next` {-}
Two special keywords control loop flow. `break` exits the loop immediately; `next` skips to the next iteration:
```{r break-next}
# next: skip documents with fewer than 13 tokens
cat("Long documents only:\n")
for (i in seq_len(nrow(corpus))) {
if (corpus$n_tokens[i] < 13) next
cat(" ", corpus$doc_id[i], "-", corpus$n_tokens[i], "tokens\n")
}
# break: stop at the first Academic document
cat("\nFirst Academic document:\n")
for (i in seq_len(nrow(corpus))) {
if (corpus$register[i] == "Academic") {
cat(" ", corpus$doc_id[i], ":", substr(corpus$text[i], 1, 50), "...\n")
break
}
}
```
## Nested `for` loops {-}
Loops can be nested — the inner loop runs completely for each iteration of the outer loop:
```{r nested}
registers <- unique(corpus$register)
eras <- unique(corpus$era)
cat("Documents per register x era:\n")
for (reg in registers) {
for (era in eras) {
n <- sum(corpus$register == reg & corpus$era == era)
cat(sprintf(" %-10s x %-7s : %d\n", reg, era, n))
}
}
```
::: {.callout-warning}
## Loops are often not the best tool
Before writing a loop, ask: does a vectorised function or `dplyr` verb already do this? Vectorised operations in R are implemented in C and run far faster than R-level loops.
```{r vectorised-alt}
# Instead of a loop for character counts:
nchar(corpus$text) # nchar() is already vectorised
# Instead of a loop for per-register summaries:
corpus |>
dplyr::group_by(register) |>
dplyr::summarise(mean_tok = mean(n_tokens), .groups = "drop")
```
Loops shine when each iteration depends on the result of the previous one, when you are reading or writing files, or when no vectorised alternative exists.
:::
---
::: {.callout-tip}
## Exercises: `for` Loops
:::
**Q1. Why should you pre-allocate your output vector before a `for` loop rather than growing it with `c()` inside the loop?**
```{r}
#| echo: false
#| label: "FOR_Q1"
check_question("Growing with c() forces R to copy the entire vector on every iteration, which becomes extremely slow for large loops; pre-allocation assigns memory once",
options = c(
"Growing with c() forces R to copy the entire vector on every iteration, which becomes extremely slow for large loops; pre-allocation assigns memory once",
"c() inside a loop always produces incorrect results",
"Pre-allocation is only needed for loops with more than 1000 iterations",
"There is no practical difference — both approaches have the same performance"
),
type = "radio",
q_id = "FOR_Q1",
random_answer_order = TRUE,
button_label = "Check answer",
right = "Correct! Each call to `c(results, new_value)` creates a brand new vector by copying all existing elements plus the new one. After n iterations, R has made n copies of progressively larger vectors — the total work is O(n²) instead of O(n). For 10,000 iterations this can mean the difference between milliseconds and minutes. Pre-allocating with `integer(n)` or `vector('list', n)` reserves the memory once and simply fills it by index.",
wrong = "Think about what `c(existing_vector, new_value)` does each time it runs. If `existing_vector` already has 999 elements, what does R have to do to add one more?")
```
**Q2. What does `next` do inside a `for` loop?**
```{r}
#| echo: false
#| label: "FOR_Q2"
check_question("It skips the remainder of the current iteration and moves to the next one",
options = c(
"It skips the remainder of the current iteration and moves to the next one",
"It exits the loop entirely",
"It repeats the current iteration from the beginning",
"It pauses execution and waits for user input"
),
type = "radio",
q_id = "FOR_Q2",
random_answer_order = TRUE,
button_label = "Check answer",
right = "Correct! `next` is equivalent to `continue` in Python, Java, or C. When R encounters `next`, it jumps immediately to the start of the next iteration without executing any remaining code in the loop body. Use it to skip rows that fail a quality check, skip files that cannot be read, or skip elements that do not meet a processing condition. `break` is the companion keyword that exits the loop entirely.",
wrong = "There are two loop-control keywords: one skips to the next iteration, one exits the loop. Which is which?")
```
**Q3. Why is `seq_along(x)` preferred over `1:length(x)` when looping over a vector `x`?**
```{r}
#| echo: false
#| label: "FOR_Q3"
check_question("seq_along(x) returns an empty sequence when x has length 0, while 1:length(x) returns c(1, 0) and causes the loop to run twice with wrong indices",
options = c(
"seq_along(x) returns an empty sequence when x has length 0, while 1:length(x) returns c(1, 0) and causes the loop to run twice with wrong indices",
"seq_along() is faster than the : operator for all vector lengths",
"seq_along() automatically handles character and factor vectors; 1:length(x) only works for numerics",
"There is no practical difference — they are fully interchangeable"
),
type = "radio",
q_id = "FOR_Q3",
random_answer_order = TRUE,
button_label = "Check answer",
right = "Correct! Try it: `1:length(character(0))` returns `c(1, 0)` — a two-element vector — because `length(character(0))` is 0 and `1:0` counts down. This causes a loop meant to do nothing to run twice with nonsensical indices. `seq_along(character(0))` correctly returns `integer(0)` — an empty sequence — so the loop body never executes.",
wrong = "Try mentally evaluating both expressions when x is an empty vector. What does 1:length(x) produce? What does seq_along(x) produce?")
```
---
# `while` Loops {#whileloops}
::: {.callout-note}
## Section Overview
**What you will learn:** How to write loops that run until a condition changes rather than for a fixed number of iterations.
**Key concepts:** Loop condition, infinite loops, `break` as a safety exit
**When to use:** Convergence algorithms, reading data streams, retrying failed operations
:::
A `while` loop runs its body **as long as its condition remains `TRUE`**. Use it when the number of iterations is not known in advance.
```{r while-basic}
token_counts <- corpus$n_tokens
total <- 0
i <- 0
while (total < 50) {
i <- i + 1
total <- total + token_counts[i]
}
cat("Reached", total, "tokens after", i, "documents.\n")
```
## A text-processing example {-}
Here we simulate reading tokens from a stream until we hit a sentence boundary:
```{r while-text}
tokens <- c("The", "quick", "brown", "fox", "jumps", ".", "Over", "the", "lazy")
sentence <- character(0)
j <- 0
while (j < length(tokens)) {
j <- j + 1
current <- tokens[j]
sentence <- c(sentence, current)
if (grepl("\\.$", current)) break
}
cat("First sentence:", paste(sentence, collapse = " "), "\n")
```
## Avoiding infinite loops {-}
A `while` loop runs forever if its condition never becomes `FALSE`. Always ensure the loop body modifies the condition variable, and include a maximum iteration counter as a safety exit:
```{r while-safe}
max_iter <- 1000
iter <- 0
value <- 100
while (value > 1 && iter < max_iter) {
value <- value * 0.9
iter <- iter + 1
}
cat("Converged to", round(value, 4), "after", iter, "iterations.\n")
```
::: {.callout-warning}
## If your R session freezes
If you accidentally create an infinite loop, press **Escape** in the Console, or click the **Stop** button (red square) in the Console toolbar. RStudio will interrupt the running code. If that fails, use **Session → Interrupt R** from the menu.
:::
---
::: {.callout-tip}
## Exercises: `while` Loops
:::
**Q1. When is a `while` loop more appropriate than a `for` loop?**
```{r}
#| echo: false
#| label: "WHILE_Q1"
check_question("When the number of iterations is not known in advance and depends on a condition that changes during execution",
options = c(
"When the number of iterations is not known in advance and depends on a condition that changes during execution",
"When you need to iterate over every element in a fixed-length vector",
"When you need faster execution than a for loop provides",
"When you want to loop over the rows of a data frame"
),
type = "radio",
q_id = "WHILE_Q1",
random_answer_order = TRUE,
button_label = "Check answer",
right = "Correct! Use `for` when you know exactly how many iterations are needed. Use `while` when the stopping point is determined dynamically during execution — keep going until convergence, until end of file, until a retry succeeds. In linguistics, `while` is useful for processing token streams, simulating iterative processes, or implementing search algorithms.",
wrong = "Think about what distinguishes the two loops: one iterates a known number of times over a predefined sequence; the other continues until something changes.")
```
**Q2. What is the risk of writing `while (TRUE) { ... }` without a `break` statement inside the body?**
```{r}
#| echo: false
#| label: "WHILE_Q2"
check_question("The loop runs forever — an infinite loop that freezes the R session until interrupted",
options = c(
"The loop runs forever — an infinite loop that freezes the R session until interrupted",
"R automatically stops the loop after 1000 iterations",
"The loop runs exactly once because TRUE evaluates to 1",
"R throws an error immediately because TRUE is not a valid condition"
),
type = "radio",
q_id = "WHILE_Q2",
random_answer_order = TRUE,
button_label = "Check answer",
right = "Correct! `while (TRUE)` creates a loop whose condition is always `TRUE` and therefore never naturally terminates. Without a `break` statement inside the body, it runs indefinitely. Press Escape or click the Stop button in RStudio to interrupt.",
wrong = "What happens to a condition that is always TRUE? Does `while` have any built-in safety limit?")
```
---
# Writing Functions {#functions}
::: {.callout-note}
## Section Overview
**What you will learn:** How to write your own reusable functions — the single most important skill for writing clean, maintainable R code.
**Key concepts:** Function definition, arguments, default values, return values, scope, documentation
**Why it matters:** Functions eliminate copy-paste errors, make your intentions explicit, and make code testable and shareable. If you have written the same block of code more than twice, it should be a function.
:::
## Anatomy of a function {-}
```{r fn-anatomy}
# Template:
# my_function <- function(required_arg, optional_arg = default_value) {
# # body
# return(result) # optional: last expression is returned automatically
# }
greet_language <- function(language) {
paste("Hello from", language, "linguistics!")
}
greet_language("computational")
greet_language("corpus")
```
## Required and optional arguments {-}
Arguments without a default are **required** — omitting them raises an error. Arguments with a default are **optional** and use their default when not supplied:
```{r fn-args}
ttr <- function(tokens, lowercase = TRUE) {
if (lowercase) tokens <- tolower(tokens)
n_tokens <- length(tokens)
n_types <- length(unique(tokens))
n_types / n_tokens
}
sample_tokens <- c("The", "cat", "sat", "on", "the", "mat", "the", "cat")
ttr(sample_tokens) # lowercase = TRUE (default)
ttr(sample_tokens, lowercase = FALSE) # case-sensitive
```
## Return values {-}
A function automatically returns its last evaluated expression. Use `return()` explicitly for early exits when input validation requires it:
```{r fn-return}
safe_ttr <- function(tokens, lowercase = TRUE) {
if (length(tokens) == 0) {
warning("Empty token vector supplied — returning NA.")
return(NA_real_)
}
if (!is.character(tokens)) {
stop("tokens must be a character vector.")
}
if (lowercase) tokens <- tolower(tokens)
length(unique(tokens)) / length(tokens)
}
safe_ttr(character(0)) # triggers warning, returns NA
safe_ttr(sample_tokens) # normal use
```
## Returning multiple values {-}
Functions can return only one object, but that object can be a **named list** containing as many results as needed:
```{r fn-multireturn}
corpus_stats <- function(tokens, lowercase = TRUE) {
if (lowercase) tokens <- tolower(tokens)
list(
n_tokens = length(tokens),
n_types = length(unique(tokens)),
ttr = round(length(unique(tokens)) / length(tokens), 3),
longest = tokens[which.max(nchar(tokens))]
)
}
result <- corpus_stats(sample_tokens)
result$ttr
result$longest
str(result)
```
## Function scope {-}
Variables created inside a function **live only inside that function** — they are invisible to the global environment and cannot accidentally overwrite your workspace objects:
```{r fn-scope}
cleanup_text <- function(text) {
cleaned <- gsub("[[:punct:]]", "", text) # local to the function
cleaned <- trimws(tolower(cleaned))
cleaned
}
raw_text <- " Hello, World! "
cleaned_text <- cleanup_text(raw_text)
cleaned_text # the returned value
exists("cleaned") # FALSE — 'cleaned' never escaped the function
```
::: {.callout-tip}
## The `<<-` operator
If you need to modify a variable in the calling environment from inside a function (rare), use `<<-`. This searches up the call stack and modifies the variable there. However, this is considered bad practice in most data analysis code because it creates hidden side effects that make functions unpredictable. Prefer returning a value and assigning it explicitly.
:::
## Documenting functions {-}
Good functions should be documented so you and colleagues can understand them months later. The conventional format mirrors the `roxygen2` package style:
```{r fn-docs}
#' Compute Type-Token Ratio
#'
#' @description
#' Calculates the type-token ratio (TTR) of a character vector of tokens.
#' TTR = number of unique word types / total number of tokens.
#'
#' @param tokens A character vector of tokens (words).
#' @param lowercase Logical. If TRUE (default), tokens are lowercased before
#' counting, so "The" and "the" count as the same type.
#'
#' @return A single numeric value between 0 and 1. Values closer to 1
#' indicate higher lexical diversity.
#'
#' @examples
#' ttr(c("the", "cat", "sat", "on", "the", "mat"))
#' ttr(c("The", "Cat", "sat"), lowercase = FALSE)
ttr <- function(tokens, lowercase = TRUE) {
if (length(tokens) == 0) return(NA_real_)
if (lowercase) tokens <- tolower(tokens)
length(unique(tokens)) / length(tokens)
}
```
## Building a reusable text-cleaning pipeline {-}
Here is a realistic example: a family of small, focused functions composed into a pipeline:
```{r fn-pipeline}
normalise_text <- function(text) {
text <- tolower(trimws(text))
gsub("\\s+", " ", text)
}
remove_punct <- function(text) {
gsub("[[:punct:]]", "", text)
}
tokenise <- function(text) {
strsplit(text, "\\s+")[[1]]
}
remove_stopwords <- function(tokens,
stopwords = c("the","a","an","of","in","and","to","is")) {
tokens[!tokens %in% stopwords]
}
clean_and_tokenise <- function(text, stopwords = NULL) {
text <- normalise_text(text)
text <- remove_punct(text)
tokens <- tokenise(text)
if (!is.null(stopwords)) tokens <- remove_stopwords(tokens, stopwords)
tokens
}
# Apply to a single document
example_text <- "The syntactic properties of embedded clauses remain poorly understood."
clean_and_tokenise(example_text,
stopwords = c("the","a","an","of","in","and","to","is"))
# Apply to all documents in the corpus
sw <- c("the","a","an","of","in","and","to","is")
corpus$content_tokens <- sapply(corpus$text, \(t) length(clean_and_tokenise(t, sw)))
head(corpus[, c("doc_id", "register", "n_tokens", "content_tokens")])
```
---
::: {.callout-tip}
## Exercises: Writing Functions
:::
**Q1. A function has no explicit `return()` statement. What does it return?**
```{r}
#| echo: false
#| label: "FN_Q1"
check_question("The value of the last evaluated expression in the function body",
options = c(
"The value of the last evaluated expression in the function body",
"NULL — you must always use return() to get a value back",
"The first argument passed to the function",
"An error — return() is mandatory in R"
),
type = "radio",
q_id = "FN_Q1",
random_answer_order = TRUE,
button_label = "Check answer",
right = "Correct! R uses implicit return: the last evaluated expression is automatically returned. This is why `add <- function(x, y) x + y` works perfectly without `return(x + y)`. Explicit `return()` is most useful when you need to exit early based on a condition.",
wrong = "R has implicit return behaviour — think about what happens at the console when you type an expression. Functions work the same way.")
```
**Q2. You write `x <- 99` inside a function body. After calling the function, does `x` exist in the global environment?**
```{r}
#| echo: false
#| label: "FN_Q2"
check_question("No — variables assigned with <- inside a function exist only in that function's local environment and disappear when the function finishes",
options = c(
"No — variables assigned with <- inside a function exist only in that function's local environment and disappear when the function finishes",
"Yes — all assignments inside functions are automatically global",
"Only if x was already defined before calling the function",
"Only if x is passed as an argument to the function"
),
type = "radio",
q_id = "FN_Q2",
random_answer_order = TRUE,
button_label = "Check answer",
right = "Correct! Each function call creates its own local environment. Variables assigned with `<-` are created there and discarded when the function exits. This prevents functions from accidentally modifying your data and means you do not need to worry about name clashes with workspace objects.",
wrong = "Where does a variable 'live' when it is created inside a function? Does it automatically escape into the global workspace?")
```
**Q3. Your function computes three things: n_tokens, n_types, and TTR. What is the best way to return all three?**
```{r}
#| echo: false
#| label: "FN_Q3"
check_question("Return a named list: list(n_tokens = ..., n_types = ..., ttr = ...)",
options = c(
"Return a named list: list(n_tokens = ..., n_types = ..., ttr = ...)",
"Use three separate return() calls: return(n_tokens); return(n_types); return(ttr)",
"Assign them to global variables with <<- before returning",
"Return a character string that concatenates all three values"
),
type = "radio",
q_id = "FN_Q3",
random_answer_order = TRUE,
button_label = "Check answer",
right = "Correct! A function can only return one object, but that object can be a named list. The caller then accesses individual results with `$`: `result$ttr`. Multiple `return()` calls do not work — the first one exits the function immediately. Assigning to globals with `<<-` creates hidden side effects and is bad practice.",
wrong = "A function can only return *one* object. But what kind of object can hold multiple values of different types?")
```
---
# The `apply` Family {#apply}
::: {.callout-note}
## Section Overview
**What you will learn:** How to apply a function to every element of a vector or list without writing an explicit loop.
**Key functions:** `sapply()`, `lapply()`, `apply()`
**Why it matters:** The `apply` family is more concise than loops and expresses intent clearly — "apply this function to each element of this object."
:::
## `sapply()` — simplified apply {-}
`sapply()` applies a function to each element of a vector or list and **simplifies the result** to a vector or matrix if possible:
```{r sapply}
# Character count for each document
nchar_results <- sapply(corpus$text, nchar)
head(nchar_results)
# Type-token ratio across a list of token vectors
token_lists <- list(
doc1 = c("the", "cat", "sat", "on", "the", "mat"),
doc2 = c("a", "quick", "brown", "fox", "jumps", "over", "the", "lazy", "dog"),
doc3 = c("to", "be", "or", "not", "to", "be")
)
sapply(token_lists, ttr)
```
Use an anonymous function for more complex operations:
```{r sapply-anon}
# Count content words (longer than 3 characters) per document
sapply(corpus$text, \(t) {
tokens <- strsplit(tolower(t), "\\s+")[[1]]
sum(nchar(tokens) > 3)
})
```
::: {.callout-tip}
## The `\(x)` shorthand (R 4.1+)
Since R 4.1 you can write anonymous functions more concisely using the backslash lambda:
```r
# Old style
sapply(x, function(t) nchar(t))
# New shorthand
sapply(x, \(t) nchar(t))
```
:::
## `lapply()` — list apply {-}
`lapply()` always returns a **list**, making it safer when results have different lengths or types:
```{r lapply}
# Unique words per document (different lengths per document)
unique_words <- lapply(token_lists, unique)
unique_words
```
## `apply()` — matrix / data frame apply {-}
`apply()` operates on **matrices or data frames**, applying a function across rows (`MARGIN = 1`) or columns (`MARGIN = 2`):
```{r apply}
features <- matrix(
c(corpus$n_tokens, corpus$content_tokens),
ncol = 2,
dimnames = list(corpus$doc_id, c("n_tokens", "content_tokens"))
)
apply(features, MARGIN = 2, FUN = mean) # column means
head(apply(features, MARGIN = 1, FUN = sum)) # row sums
```
## Choosing between `sapply()` and `lapply()` {-}
```{r apply-choice, echo=FALSE, message=FALSE, warning=FALSE}
data.frame(
Function = c("sapply()", "lapply()", "apply()"),
Input = c("vector or list", "vector or list", "matrix or data frame"),
Output = c("vector/matrix (simplified) or list if simplification fails",
"always a list",
"vector or list"),
Use_when = c("results are all the same type and length",
"results differ in length or type; you always want a list",
"you want to summarise across rows or columns of a matrix")
) |>
dplyr::rename("Use when" = Use_when) |>
flextable() |>
flextable::set_table_properties(width = .99, layout = "autofit") |>
flextable::theme_zebra() |>
flextable::fontsize(size = 11) |>
flextable::fontsize(size = 11, part = "header") |>
flextable::align_text_col(align = "left") |>
flextable::set_caption(caption = "Choosing the right apply function.") |>
flextable::border_outer()
```
---
# Functional Programming with `purrr` {#purrr}
::: {.callout-note}
## Section Overview
**What you will learn:** How to use `purrr::map()` and its variants as a modern, consistent alternative to the `apply` family.
**Key functions:** `map()`, `map_chr()`, `map_dbl()`, `map_df()`, `map2()`, `walk()`
**Why it matters:** `purrr` functions have consistent, predictable behaviour and integrate cleanly with `dplyr` pipelines.
:::
The `purrr` package provides a family of `map()` functions that replace the `apply` family with a more consistent interface. Every `map()` function takes a list or vector and applies a function to each element.
## `map()` and type-specific variants {-}
`map()` always returns a list. Type-specific variants guarantee a particular output type and fail informatively if the results do not match:
```{r purrr-map}
map(token_lists, length) # list
map_dbl(token_lists, ttr) # numeric vector
map_int(token_lists, length) # integer vector
map_chr(token_lists, \(t) paste(t[1:2], collapse = " ")) # character vector
```
## `map_df()` — map to a data frame {-}
`map_df()` applies a function that returns a data frame to each element and binds the results together:
```{r purrr-mapdf}
get_doc_stats <- function(tokens, doc_name) {
data.frame(
doc = doc_name,
n_tokens = length(tokens),
n_types = length(unique(tokens)),
ttr = round(ttr(tokens), 3)
)
}
map_df(names(token_lists), \(nm) get_doc_stats(token_lists[[nm]], nm))
```
## `map2()` — map over two inputs simultaneously {-}
`map2()` applies a function to corresponding elements of two vectors or lists:
```{r purrr-map2}
doc_ids <- c("text_A", "text_B", "text_C")
texts <- list(
c("the", "cat", "sat"),
c("a", "dog", "ran"),
c("one", "two", "three", "one")
)
map2_dbl(texts, doc_ids, \(tokens, id) {
cat(id, ": TTR =", round(ttr(tokens), 3), "\n")
ttr(tokens)
})
```
## `walk()` — map for side effects {-}
`walk()` is like `map()` but used when you want the **side effect** (printing, writing a file, making a plot) rather than the return value. It invisibly returns the input, enabling piping:
```{r purrr-walk}
corpus |>
dplyr::group_by(register) |>
dplyr::group_split() |>
walk(\(df) {
cat(sprintf("Register: %-10s | Docs: %d | Mean tokens: %.1f\n",
df$register[1], nrow(df), mean(df$n_tokens)))
})
```
---
::: {.callout-tip}
## Exercises: `apply` and `purrr`
:::
**Q1. What is the difference between `sapply()` and `lapply()`?**
```{r}
#| echo: false
#| label: "APP_Q1"
check_question("sapply() tries to simplify the result to a vector or matrix; lapply() always returns a list",
options = c(
"sapply() tries to simplify the result to a vector or matrix; lapply() always returns a list",
"lapply() is faster than sapply() for all inputs",
"sapply() works on lists; lapply() works on vectors",
"They are identical — sapply is just an alias for lapply"
),
type = "radio",
q_id = "APP_Q1",
random_answer_order = TRUE,
button_label = "Check answer",
right = "Correct! `sapply(x, f)` calls `lapply(x, f)` internally and then tries to simplify: if all outputs are scalars of the same type it returns a vector; if all outputs are vectors of the same length it returns a matrix; otherwise a list. `lapply()` always returns a list — prefer it when outputs may vary in length or type.",
wrong = "Both functions apply a function to each element. The key difference is in the *type of object they return*.")
```
**Q2. When would you use `purrr::walk()` instead of `purrr::map()`?**
```{r}
#| echo: false
#| label: "APP_Q2"
check_question("When you want the side effects (printing, saving files, plotting) rather than the return values",
options = c(
"When you want the side effects (printing, saving files, plotting) rather than the return values",
"When the function returns a logical vector",
"walk() is faster than map() for numeric computations",
"When iterating over a list rather than a vector"
),
type = "radio",
q_id = "APP_Q2",
random_answer_order = TRUE,
button_label = "Check answer",
right = "Correct! `walk()` is the appropriate choice when the purpose of iterating is a side effect — writing files, printing output, generating plots — rather than collecting return values. `walk()` invisibly returns the input (enabling piping) while discarding the function's return values.",
wrong = "Think about what the 'output' of the iteration is. Are you collecting return values to use later, or performing actions whose value lies in what they do?")
```
---
# Error Handling {#errors}
::: {.callout-note}
## Section Overview
**What you will learn:** How to write code that handles errors and warnings gracefully rather than crashing.
**Key functions:** `tryCatch()`, `try()`, `stop()`, `warning()`, `message()`
**Why it matters:** When processing many files or documents, a single error should not halt your entire pipeline.
:::
## Signalling conditions from your functions {-}
Use `stop()`, `warning()`, and `message()` to communicate problems from inside your functions:
```{r conditions}
compute_ttr <- function(tokens) {
if (!is.character(tokens)) stop("tokens must be a character vector")
if (length(tokens) == 0) warning("Empty vector — returning NA")
if (length(tokens) < 10) message("Note: TTR is unreliable for short texts")
if (length(tokens) == 0) return(NA_real_)
length(unique(tokens)) / length(tokens)
}
compute_ttr(c("the", "cat", "sat")) # triggers message: short text
```
The three signals have different effects on execution: `stop()` halts immediately; `warning()` signals a problem but continues; `message()` prints an informational note and continues.
## `tryCatch()` — handle errors gracefully {-}
`tryCatch()` intercepts errors, warnings, and messages, letting you decide what to do instead of crashing:
```{r trycatch}
safe_ttr <- function(tokens) {
tryCatch(
expr = compute_ttr(tokens),
error = function(e) {
cat("Error in compute_ttr:", conditionMessage(e), "\n")
NA_real_
},
warning = function(w) {
cat("Warning:", conditionMessage(w), "\n")
NA_real_
}
)
}
safe_ttr(c("the", "cat", "sat", "on", "the", "mat")) # normal
safe_ttr(123) # wrong type → error caught
safe_ttr(character(0)) # empty → warning caught
```
## Applying `tryCatch()` across a pipeline {-}
This pattern is invaluable when processing many documents — one bad item should not stop the whole run:
```{r trycatch-pipeline}
inputs <- list(
c("the", "cat", "sat", "on", "the", "mat"),
123, # will error
character(0), # will warn
c("a", "quick", "fox")
)
results <- sapply(inputs, safe_ttr)
results
```
---
::: {.callout-tip}
## Exercises: Error Handling
:::
**Q1. What is the difference between `stop()`, `warning()`, and `message()` inside a function?**
```{r}
#| echo: false
#| label: "ERR_Q1"
check_question("stop() halts execution with an error; warning() continues but signals a potential problem; message() prints an informational note without affecting control flow",
options = c(
"stop() halts execution with an error; warning() continues but signals a potential problem; message() prints an informational note without affecting control flow",
"All three halt execution — they only differ in the colour of the output",
"stop() is for syntax errors; warning() is for runtime errors; message() is for logic errors",
"warning() and message() are identical; stop() is the only one that affects execution"
),
type = "radio",
q_id = "ERR_Q1",
random_answer_order = TRUE,
button_label = "Check answer",
right = "Correct! `stop()` raises an error condition that immediately halts execution unless caught by `tryCatch()`. `warning()` raises a warning — R continues execution and collects warnings to display at the end. `message()` sends an informational note to stderr without raising any condition. Use `stop()` for invalid inputs; `warning()` for valid but suspicious inputs; `message()` for progress updates.",
wrong = "Think about what happens to program execution after each: does it continue, pause, or stop?")
```
**Q2. Why is wrapping a function call in `tryCatch()` useful when processing a large number of files or documents?**
```{r}
#| echo: false
#| label: "ERR_Q2"
check_question("It prevents one failed item from crashing the entire pipeline — errors are caught and handled gracefully, and processing continues with the remaining items",
options = c(
"It prevents one failed item from crashing the entire pipeline — errors are caught and handled gracefully, and processing continues with the remaining items",
"tryCatch() makes the function run faster by skipping validation",
"It automatically fixes the error and retries the operation",
"It is only useful for file I/O operations, not for statistical functions"
),
type = "radio",
q_id = "ERR_Q2",
random_answer_order = TRUE,
button_label = "Check answer",
right = "Correct! Without error handling, a single corrupt file or one row with missing data can crash an entire overnight batch job. `tryCatch()` lets you define what to do when something goes wrong — log the error, return `NA`, and continue with the next item. This is essential for robust corpus processing pipelines.",
wrong = "Imagine you are processing 5,000 text files and one of them is corrupt. Without error handling, what happens to the other 4,999?")
```
---
# Best Practices {#bestpractice}
::: {.callout-note}
## Section Overview
**A concise guide to writing better R code:** when to loop, when to vectorise, how to name and document functions, and the DRY principle.
:::
## When to use each construct {-}
```{r when-table, echo=FALSE, message=FALSE, warning=FALSE}
data.frame(
Situation = c(
"Apply the same operation to every element of a vector",
"Apply the same operation to each group in a data frame",
"Apply a function to each element and collect results",
"Iterate when each step depends on the previous result",
"Number of iterations unknown; stop when condition met",
"Apply a function for its side effects (print, save, plot)",
"Handle different cases of a single categorical variable"
),
Best_tool = c(
"Vectorised operation (e.g. nchar(), tolower(), arithmetic)",
"dplyr::group_by() + summarise() or mutate()",
"sapply() / lapply() / purrr::map()",
"for loop with pre-allocated output",
"while loop (with break safety exit)",
"purrr::walk() or a for loop",
"ifelse() / case_when() / switch()"
)
) |>
dplyr::rename("Best tool" = Best_tool) |>
flextable() |>
flextable::set_table_properties(width = .99, layout = "autofit") |>
flextable::theme_zebra() |>
flextable::fontsize(size = 11) |>
flextable::fontsize(size = 11, part = "header") |>
flextable::align_text_col(align = "left") |>
flextable::set_caption(caption = "Choosing the right construct for the task.") |>
flextable::border_outer()
```
## Writing good functions {-}
- **Do one thing well** — a function named `clean_tokenise_count_and_plot()` is a sign it should be four functions
- **Name with verbs** — `clean_text()`, `compute_ttr()`, `plot_frequency()`, not `myFunc()` or `data2()`
- **Validate inputs early** — use `stop()` at the top of the function body for invalid arguments
- **Provide useful defaults** — optional arguments should default to the most common use case
- **Document** — at minimum a comment above explaining what the function does, its arguments, and what it returns
- **Test** — call your function with expected inputs, edge cases (empty vector, single element, `NA`), and invalid inputs
## Code style {-}
```r
# Good: clear structure, consistent indentation, descriptive names
compute_register_stats <- function(data, group_col = "register") {
data |>
dplyr::group_by(.data[[group_col]]) |>
dplyr::summarise(
n = dplyr::n(),
mean_tok = round(mean(n_tokens), 1),
sd_tok = round(sd(n_tokens), 2),
.groups = "drop"
)
}
# Bad: cryptic names, no whitespace, no structure
f<-function(d,g="register"){d%>%group_by(.data[[g]])%>%summarise(n=n(),m=round(mean(n_tokens),1))}
```
## The DRY principle {-}
**Don't Repeat Yourself.** If you find yourself copy-pasting a block of code and changing one value, that block should be a function parameterised by that value. Code duplication multiplies the places you must update when requirements change and multiplies the opportunities for inconsistency.
```r
# Before: copy-pasted three times with minor changes
academic_ttr <- ...
news_ttr <- ...
fiction_ttr <- ...
# After: one function, called three times
get_register_ttr <- function(data, reg) { ... }
sapply(c("Academic", "News", "Fiction"), get_register_ttr, data = corpus)
```
---
# Citation & Session Info {.unnumbered}
::: {.callout-note}
## Citation
```{r citation-callout, echo=FALSE, results='asis'}
cat(
params$author, ". ",
params$year, ". *",
params$title, "*. ",
params$institution, ". ",
"url: ", params$url, " ",
"(Version ", params$version, "), ",
"doi: ", params$doi, ".",
sep = ""
)
```
```{r citation-bibtex, echo=FALSE, results='asis'}
key <- paste0(
tolower(gsub(" ", "", gsub(",.*", "", params$author))),
params$year,
tolower(gsub("[^a-zA-Z]", "", strsplit(params$title, " ")[[1]][1]))
)
cat("```\n")
cat("@manual{", key, ",\n", sep = "")
cat(" author = {", params$author, "},\n", sep = "")
cat(" title = {", params$title, "},\n", sep = "")
cat(" year = {", params$year, "},\n", sep = "")
cat(" note = {", params$url, "},\n", sep = "")
cat(" organization = {", params$institution, "},\n", sep = "")
cat(" edition = {", params$version, "}\n", sep = "")
cat(" doi = {", params$doi, "}\n", sep = "")
cat("}\n```\n")
```
:::
```{r fin}
sessionInfo()
```
::: {.callout-note}
## AI Transparency Statement
This tutorial was re-developed with the assistance of **Claude** (claude.ai), a large language model created by Anthropic. Claude was used to help revise the tutorial text, structure the instructional content, generate the R code examples, and write the `checkdown` quiz questions and feedback strings. All content was reviewed, edited, and approved by the author (Martin Schweinberger), who takes full responsibility for the accuracy and pedagogical appropriateness of the material. The use of AI assistance is disclosed here in the interest of transparency and in accordance with emerging best practices for AI-assisted academic content creation.
:::
[Back to top](#intro)
[Back to HOME](/index.html)
# References {.unnumbered}